This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.
You can find the complete handbook on Github
This code chunk shows the loading of packages required for the analyses.
pacman::p_load(rio, # File import
here, # File locator
tidyverse, # data management + ggplot2 graphics
aweek, # working with dates
lubridate, # Manipulate dates
incidence, # an option for epicurves of linelist data
stringr, # Search and manipulate character strings
forcats, # working with factors
RColorBrewer) # Color palettes from colorbrewer2.orgTwo example datasets are used in this section:
The dataset is imported using the import() function from the rio package. See the page on importing data for various ways to import data. The linelist and aggregated versions of the data are displayed below.
For most of this document, the linelist dataset will be used. The aggregated counts dataset will be used at the end.
Review the two datasets and notice the differences
Case linelist
The first 50 rows are displayed
Case counts aggregated by hospital
The first 50 rows are displayed
You may want to set certain parameters for production of a report, such as the date for which the data is current (the “data date”). You can then reference the data_date in the code when applying filters or in captions that auto-update.
Verify that each relevant date column is class Date and has an appropriate range of values. This for loop prints a histogram for each column.
# create character vector of column names
DateCols <- as.character(tidyselect::vars_select(names(linelist), matches("date|Date|dt")))
# Produce histogram of each date column
for (Col in DateCols) { # open loop. iterate for each name in vector DateCols
hist(linelist[, Col], # print histogram of the column in linelist dataframe
breaks = 50, # number of breaks for the histogram
xlab = Col) # x-axis label is the name of the column
} # close the loopincidence packageBelow are tabs on making quick epicurves using the incidence package
CAUTION: Epicontacts expects data to be in a “linelist” format of one row per case (not aggregated). If your data is aggregated counts, look to the ggplot epicurves tab.
TIP: The documentation for plotting an incidence object can be accessed by entering ?plot.incidence in your R console.
2 steps are requires to plot an epicurve with the incidence package:
incidence())
A simple example - an epicurve of daily cases:
# load incidence package
library(incidence)
# create the incidence object using data by day
epi_day <- incidence(linelist$date_onset, # the linelist data
interval = "day") # the time interval
# plot the incidence object
plot(epi_day)Change time interval of case aggregation (bars)
The interval argument defines how the observations are grouped. Available options include all the options from the package aweek, including but not limited to:
Below are examples of how different intervals look when applied to the linelist.
Format and frequency of the date labels on the x-axis are the defaults for the specified interval.
# Create the incidence objects (with different intervals)
##############################
# Weekly (Monday week by default)
epi_wk <- incidence(linelist$date_onset, interval = "Monday week")
# Sunday week
epi_Sun_wk <- incidence(linelist$date_onset, interval = "Sunday week")
# Three weeks (Monday weeks by default)
epi_3wk <- incidence(linelist$date_onset, interval = "3 weeks")
# Monthly
epi_month <- incidence(linelist$date_onset, interval = "month")
# Plot the incidence objects (+ titles for clarity)
############################
plot(epi_wk)+ labs(title = "Monday weeks")
plot(epi_Sun_wk)+ labs(title = "Sunday weeks")
plot(epi_3wk)+ labs(title = "Every 3 Monday weeks")
plot(epi_month)+ labs(title = "Months")The incidence package enables modifications in the following ways:
plot() (e.g. show_cases, col_pal, alpha…)scale_x_incidence() and make_labels()ggplot() additions via the + operatorRead details in the Help files by entering ?scale_x_incidence and ?plot.incidence in the R console. Online vignettes are listed in the resources tab.
plot() modificationsA incidence plot can be modified in the following ways. Type ?plot.incidence in the R console for more details.
show_cases = If TRUE, each case is shows as a box. Best on smaller outbreaks.color = Color of case bars/boxesborder = Color of line around boxes, if show_cases = TRUEalpha = Transparency of case bars/boxes (1 is fully opaque, 0 is fully transparent)xlab = Title of x-axis (axis labels can also be applied using labs() from ggplot)ylab = Title of y-axis; defaults to user-defined incidence time intervallabels_week = Logical, indicate whether x-axis labels are in week or date format, absent other modificationsn_breaks = Number of x-axis label breaks, absent other modificationsfirst_date, last_date Dates used to trim the plotSee examples of these arguments in the subsequent tabs.
To plot the epicurve of a subset of data:
incidence() commandThe example below uses data filtered to show only cases at Central Hospital.
# filter the dataset
central_data <- linelist %>%
filter(hospital == "Central Hospital")
# create incidence object using subset of data
central_outbreak <- incidence(central_data$date_onset, interval = "week")
# plot
plot(central_outbreak) + labs(title = "Weekly case incidence at Central Hospital")TIP: Remember that date-axis labels are independent from the aggregation of the data into bars
Modify the bars
The aggregation of data into bars occurs when you set the interval = when creating the incidence object. The options for interval come from the package aweek and include options like “day”, “Monday week”, “Sunday week”, “month”, “2 weeks”, etc. See the incidence intro tab for more information.
Modify date-axis labels (frequency & format)
If working with the incidence package, you have several options to make these modifications. Some utilize the incidence package functions scale_x_date() and make_breaks(), others use the ggplot2 function scale_x_date(), and others use a combination.
DANGER: Be cautious setting the y-axis scale breaks (e.g. 0 to 30 by 5: seq(0, 30, 5)). Static numbers can cut-off your data if the data changes!.
scale_x_incidence() onlyscale_x_incidence() from the incidence package:
interval (e.g. Sundays or Mondays)n_breaks specify number of date labels, which start from the interval of the first case.
n_breaks = nrow(i)/n (“i” is the incidence object name and “n” is a number)labels_week labels formatted as either weeks (YYYY-Www) or dates (YYYY-MM-DD)Other notes:
?scale_x_incidence into the R console to see more information.scale_x_date() to the plot will remove labels created by scale_x_incidence# create weekly incidence object (Sunday weeks)
i <- incidence(central_data$date_onset, interval = "Sunday week")
plot(i)+
scale_x_incidence(i, # name of incidence object
labels_week = F, # show dates instead of weeks
n_breaks = nrow(i)/8) # breaks every 8 weeks from week of first casescale_x_date() and make_breaks()scale_x_date() from ggplot2, but also leverage make_breaks() from incidence:
make_breaks() to define date label breaks
make_breaks() is similar to scale_x_incidence() (described above). Provide the incidence object name and optionally n_breaks as described before.scale_x_date() to the plot:
breaks = provide the breaks vector you created with make_breaks(), followed by $breaks (see example below)date_labels = provide a format for the date labels (e.g. “%d %b”) (use “” for new line)# Break modification using scale_x_date() and make_breaks()
###########################################################
# make incidence object
i <- incidence(central_data$date_onset, interval = "Monday week")
# make breaks
i_labels <- make_breaks(i, n_breaks = nrow(i)/6) # using interval from i, breaks every 6 weeks
# plot
plot(i)+
scale_x_date(breaks = i_labels$breaks, # call the breaks
date_labels = "%d\n%b '%y", # date format
date_minor_breaks = "weeks") # gridlines each week (aligns with Sundays only) scale_x_date() onlyscale_x_date() only
date_breaks = (e.g. “day”, “week”, “2 weeks”, “month”, “year”)date_minor_breaks = for vertical lines between date labelsbreaks = and to minor_breaks =date_labels = for formatting (see Dates page for tips)expand = c(0,0) to start labels at the first incidence bar. Otherwise, first label will shift depending on your specified label interval.*Note: if using aggregated counts (for example an epiweek x-axis) your x-axis may not be Date class and may require use scale_x_discrete() instead of scale_x_date() - see ggplot tips page for more details.
# Break modification using scale_x_date() only
##############################################
# make incidence object
i <- incidence(central_data$date_onset, interval = "Monday week")
# plot
plot(i)+
scale_x_date(expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "3 weeks", # labels appear every 3 Monday weeks
date_minor_breaks = "week", # vertical lines appear every Monday week
date_labels = "%d\n%b\n'%y") # date labels format If you want a plot of Sunday weeks and also finely-adjusted label formats, you might find a code example helpful.
Here is an example of producing a weekly epicurve using incidence for Sunday weeks, with finely-adjusted date labels through scale_x_date():
# load packages
pacman::p_load(tidyverse, # for ggplot
incidence, # for epicurve
lubridate) # for floor_date() and ceiling_date()
# create incidence object (specifying SUNDAY weeks)
central_outbreak <- incidence(central_data$date_onset, interval = "Sunday week") # equivalent to "MMWRweek" (see US CDC)
# plot() the incidence object
plot(central_outbreak)+
### ggplot() commands added to the plot
# scale modifications
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
# sequence by 3 weeks, from Sunday before first case to Sunday after last case
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "3 weeks"),
# sequence by week, from Sunday before first case to Sunday after last case
minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "7 days"),
# date labels
date_labels = "%d\n%b'%y")+ # adjust how dates are displayed
scale_y_continuous(
expand = c(0,0), # remove excess space under x-axis
breaks = seq(0, 30, 5))+ # adjust y-axis intervals
# Aesthetic themes
theme_minimal()+ # simplify background
theme(
axis.title = element_text(size = 12, face = "bold"), # axis titles formatting
plot.caption = element_text(face = "italic", hjust = 0))+ # caption formatting, left-aligned
# Plot labels
labs(x = "Week of symptom onset (Sunday weeks)",
y = "Weekly case incidence",
title = "Weekly case incidence at Central Hospital",
#subtitle = "",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))To show boxes around each individual case, use the argument show_cases = TRUE in the plot() function.
Boxes around each case can be more reader-friendly, if the outbreak is of a small size. Boxes can be applied when the interval is days, weeks, or any other time period. The code below creates the weekly epicurve for a smaller outbreak (only cases from Central Hospital), with boxes around each case.
# create filtered dataset for Central Hospital
central_data <- linelist %>%
filter(hospital == "Central Hospital")
# create incidence object (weekly)
central_outbreak <- incidence(central_data$date_onset, interval = "Monday week")
# plot outbreak
plot(central_outbreak,
show_cases = T) # show boxes around individual casesThe same epicurve showing individual cases, but with other aesthetic modifications:
# add plot() arguments and ggplot() commands
plot(central_outbreak,
show_cases = T, # show boxes around each individual case
color = "lightblue", # color inside boxes
border = "darkblue", # color of border around boxes
alpha = 0.5)+ # transparency
### ggplot() commands added to the plot
# scale modifications
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "4 weeks", # labels appear every 4 Monday weeks
date_minor_breaks = "week", # vertical lines appear every Monday week
date_labels = "%d\n%b'%y")+ # date labels format
scale_y_continuous(
expand = c(0,0), # remove excess space under x-axis
breaks = seq(0, 35, 5))+ # adjust y-axis intervals
# aesthetic themes
theme_minimal()+ # simplify background
theme(
axis.title = element_text(size = 12, face = "bold"), # axis title format
plot.caption = element_text(face = "italic", hjust = 0))+ # caption format and left-align
# plot labels
labs(x = "Week of symptom onset (Monday weeks)",
y = "Weekly reported cases",
title = "Weekly case incidence at Central Hospital",
#subtitle = "",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))To color the cases by value, provide the column to the groups = argument in the incidence() command. In the example below the cases are colored by their age category. Note the use of incidence() argument na_as_group =. If TRUE (by default) missing values (NA) will form their own group.
# Create incidence object, with data grouped by age category
age_outbreak <- incidence(linelist$date_onset, # date of onset for x-axis
interval = "week", # weekly aggregation of cases
groups = linelist$age_cat, # color by age_cat value
na_as_group = TRUE) # missing values assigned their own group
# plot the epicurve
plot(age_outbreak) Adjusting order
To adjust the order of group appearance (on plot and in legend), the group column must be class Factor. Adjust the order by adjusting the order of the levels (including NA). Below is an example with gender groups using data from Central Hospital only.
NA first, so it appears on the top of the barsexclude = NULL in factor() is necessary to adjust the order of NA, which is excluded by default.fill = in labs()You can read more about factors in their page (LINK)
# Create incidence object, data grouped by gender
#################################################
# Classify "gender" column as factor
####################################
# with specific level order and labels, includin for missing values
central_data <- linelist %>%
filter(hospital == "Central Hospital") %>%
mutate(gender = factor(gender,
levels = c(NA, "f", "m"),
labels = c("Missing", "Female", "Male"),
exclude = NULL))
# Create incidence object, by gender
####################################
gender_outbreak_central <- incidence(central_data$date_onset,
interval = "week",
groups = central_data$gender,
na_as_group = TRUE) # Missing values assigned their own group
# plot epicurve with modifications
##################################
plot(gender_outbreak_central,
show_cases = TRUE)+ # show box around each case
### ggplot commands added to plot
# scale modifications
scale_x_date(expand = c(0,0),
date_breaks = "6 weeks",
date_minor_breaks = "week",
date_labels = "%d %b\n%Y")+
# aesthetic themes
theme_minimal()+ # simplify plot background
theme(
legend.title = element_text(size = 14, face = "bold"),
axis.title = element_text(face = "bold"))+ # axis title bold
# plot labels
labs(fill = "Gender", # title of legend
title = "Show case boxes, with modifications",
y = "Weekly case incidence",
x = "Week of symptom onset") To change the legend
Use ggplot() commands such as:
theme(legend.position = "top") (or “bottom”, “left”, “right”)theme(legend.direction = "horizontal")theme(legend.title = element_blank()) to have no titleSee the page of ggplot() tips for more details on legends.
To specify colors manually, provide the name of the color or a character vector of multiple colors to the argument color =. Note to function properly the number of colors listed must equal the number of groups (be aware of missing values as a group)
# weekly outbreak by hospital
hosp_outbreak <- incidence(linelist$date_onset,
interval = "week",
groups = linelist$hospital,
na_as_group = FALSE) # Missing values not assigned their own group
# default colors
plot(hosp_outbreak)
# manual colors
plot(hosp_outbreak, color = c("darkgreen", "darkblue", "purple", "grey", "yellow", "orange"))To change the color palette
Use the argument col_pal in plot() to change the color palette to one of the default base R palettes (do not put the name of the palette in quotes).
Other palettes include TO DO add page with palette names… To DO
# Create incidence object, with data grouped by age category
age_outbreak <- incidence(linelist$date_onset, # date of onset for x-axis
interval = "week", # weekly aggregation of cases
groups = linelist$age_cat, # color by age_cat value
na_as_group = TRUE) # missing values assigned their own group
# plot the epicurve
plot(age_outbreak)
# plot with different color palette
plot(age_outbreak, col_pal = rainbow)To facet the plot by a variable (make “small multiples”), see the tab on epicurves with ggplot()
ggplot2Below are tabs on using the ggplot2 package to produce epicurves from a linelist dataset.
Unlike using incidence package, you must manually control the aggregation of the data (into weeks, months, etc) and the labels on the date axis. If not carefully managed, this can lead to many headaches.
These tabs use a subset of the linelist dataset - only the cases from Central Hospital.
To produce an epicurve with ggplot() there are three main elements:
Below is perhaps the most simple code to produce daily and weekly epicurves. Axis scales and labels use default options.
# daily
ggplot(data = central_data, aes(x = date_onset)) + # x column must be class Date
geom_histogram(binwidth = 1)+ # date values binned by 1 day
labs(title = "Daily")
# weekly
ggplot(data = central_data, aes(x = date_onset)) +
geom_histogram(binwidth = 7)+ # date values binned each 7 days (arbitrary 7 days!)
labs(title = "Weekly") CAUTION: Using
binwidth = 7 starts the first bin at the first case, which could be any day of the week! To create specific Monday or Sunday weeks, see below .
To create weekly epicurves where the bins begin on a specific day of the week (e.g. Monday, Sunday), specify the histogram breaks = manually (not binwidth). This can be done by creating a sequence of dates using seq.Date() from base R. You can start/end the sequence at a specific date (as.Date("YYYY-MM-DD"), or write flexible code to begin the sequence at a specific day of the week before the first case. An example of creating such weekly breaks is below:
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days")
To achieve the “from” value (earliest date of the sequence), the minimum value in the column date_onset is fed to floor_date() from the lubridate package, which according to the above specified arguments produces the start date of that “week”, given that the start of each week is a Monday (week_start = 1). Likewise, the “to” value (end date of the sequence) is specified using the inverse ceiling_date() function to produce the Monday after the last case. The “by” argument can be set to any length of days, weeks, or months.
This code is applied to create the histogram breaks, and also the breaks for the date labels. Read more about the date labels in the Modifications tab. Defining your breaks like above will be necessary if your weekly bins are not by Monday weeks.
Below is detailed code to produce weekly epicurves for Monday and Sunday weeks. See the tab on Modifications (axes) to learn the nuances of date-axis label management.
Monday weeks
Of note:
week_start = 1) before the earliest case and to end the Monday after the last case (see explanation above).date_breaks = within scale_x_date(), which also uses Monday weeks. Sunday weeks use a different method.date_minor_breaks = within scale_x_date(), again because this plot is for Monday weeks. Sunday weeks use a different method.expand = c(0,0) to the x and y scales removes excess space on each side of the plot, which also ensures the labels begin at the first bar.geom_histogram()# TOTAL MONDAY WEEK ALIGNMENT
#############################
ggplot(central_data, aes(x = date_onset)) +
# make histogram: specify bin break points: starts the Monday before first case, end Monday after last case
geom_histogram(
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days"), # bins are 7-days
color = "darkblue", # color of lines around bars
fill = "lightblue") + # color of fill within bars
# x-axis labels
scale_x_date(expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "3 weeks", # labels appear every 3 Monday weeks
date_minor_breaks = "week", # vertical lines appear every Monday week
date_labels = "%d\n%b\n'%y")+ # date labels format
# y-axis
scale_y_continuous(expand = c(0,0))+ # remove excess y-axis space between bottom of bars and the labels
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"))+ # axis titles in bold
# labels
labs(title = "Weekly incidence of cases (Monday weeks)",
subtitle = "Subtitle: Note alignment of bars, vertical lines, and axis labels on Mondays",
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))Sunday weeks
The below code creates a histogram of the rows, using a date column as the x-axis. Of note:
week_start = 7) before the earliest case and to end the Monday after the last case (see explanation above).breaks = and minor_breaks = within scale_x_date(). You cannot use the scale_x_date() arguments of date_breaks and date_minor_breaks as these align with Monday weeks.expand = c(0,0) to the x and y scales removes excess space on each side of the plot, which also ensures the labels begin at the first bar.geom_histogram()# TOTAL SUNDAY WEEK ALIGNMENT
#############################
ggplot(central_data, aes(x = date_onset)) +
# For histogram, manually specify bin break points: starts the Sunday before first case, end Sunday after last case
geom_histogram(
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "7 days"), # bins are 7-days
color = "darkblue", # color of lines around bars
fill = "lightblue") + # color of fill within bars
# The labels on the x-axis
scale_x_date(expand = c(0,0),
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "3 weeks"),
minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "7 days"),
date_labels = "%d\n%b\n'%y")+ # day, above month abbrev., above 2-digit year
# y-axis
scale_y_continuous(expand = c(0,0))+ # removes excess y-axis space between bottom of bars and the labels
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"))+ # axis titles in bold
# labels
labs(title = "Weekly incidence of cases (Sunday weeks)",
subtitle = "Subtitle: Note alignment of bars, vertical lines, and axis labels on Sundays",
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))TIP: Remember that date-axis labels are independent from the aggregation of the data into bars
To modify the aggregation of data into bins/bars, do one of the following:
binwidth = within geom_histogram() - for a column of class Date, the given number is interpreted in daysbreaks = as a sequence of bin break-point datesggplot(). See the tab on aggregated counts for more information.To modify the date labels, use scale_x_date() in one of these ways:
date_breaks = to specify label frequency (e.g. “day”, “week”, “3 weeks”, “month”, or “year”)date_minor_breaks = to specify frequency of minor vertical gridlines between date labelsexpand = c(0,0) to begin the labels at the first bar (otherwise, first label will shift forward depending on specified frequency)date_labels = to specify format of date labels - see the Dates page for tips (use \n for a new line)breaks = and minor_breaks = by providing a sequence of dates for breaksdate_labels = for formatting as described aboveTo create a sequence of dates
You can use seq.Date() from base R. You can start/end the sequence at a specific date (as.Date("YYYY-MM-DD"), or write flexible code to begin the sequence at a specific day of the week before the first case. An example of creating such flexible breaks is below:
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days")
To achieve the “from” value (earliest date of the sequence), the minimum value in the column date_onset is fed to floor_date() from the lubridate package, which according to the above specified arguments produces the start date of that “week”, given that the start of each week is a Monday (week_start = 1). Likewise, the “to” value (end date of the sequence) is specified using the inverse ceiling_date() function to produce the Monday after the last case. The “by” argument can be set to any length of days, weeks, or months.
If using aggregated counts (for example an epiweek x-axis) your x-axis may not be Date class and may require use scale_x_discrete() instead of scale_x_date() - see ggplot tips page for more details.
Set maximum and minimum date values using limits = c() within scale_x_date(). E.g. scale_x_date(limits = c(as.Date("2014-04-01), NA)) sets a minimum but leaves the maximum open.
CAUTION: Caution using limits! They remove all data outside the limits, which can impact y-axis max/min, modeling, and other statistics. Strongly consider instead using limits by adding coord_cartesian() to your plot, which acts as a “zoom” without removing data.
DANGER: Be cautious setting the y-axis scale breaks (e.g. 0 to 30 by 5: seq(0, 30, 5)). Static numbers can cut-off your data if the data changes!.
https://rdrr.io/r/base/strptime.html —– see all % shortcuts
Below is a demonstration of some plots where the bins and the plot labels/gridlines are aligned and not aligned:
# 7-day binwidth and colors specified, axes flush with labels
#################
ggplot(central_data, aes(x = date_onset)) + # x column must be class Date
geom_histogram(
binwidth = 7, # 7 days per bin (! starts at first case!)
color = "darkblue", # color of lines around bars
fill = "lightblue") + # color of bar fill
labs(
title = "MISALIGNED",
subtitle = "!CAUTION: 7-day bars start Thursdays with first case\ndefault axis labels/ticks not aligned")
# specify date_breaks 3 WEEKS
#############################
ggplot(central_data, aes(x = date_onset)) +
geom_histogram(
binwidth = 7, # 7-day bins with start at first case
color = "darkblue",
fill = "lightblue") +
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "3 weeks", # Monday every 3 weeks
date_minor_breaks = "week", # Monday weeks
date_labels = "%d\n%b\n'%y")+ # label format
scale_y_continuous(
expand = c(0,0))+ # remove excess space under x-axis, make flush with labels
labs(
title = "MISALIGNED",
subtitle = "!CAUTION: 7-day bars start Thursdays with first case\nDate labels and gridlines on Mondays")
# specify date_breaks MONTHS (note each bin begins 1st of month)
################################################################
ggplot(central_data, aes(x = date_onset)) +
geom_histogram(
binwidth = 7,
color = "darkblue",
fill = "lightblue") +
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "months", # 1st of month
date_minor_breaks = "week", # Monday weeks
date_labels = "%d\n%b\n'%y")+ # label format
scale_y_continuous(
expand = c(0,0))+ # remove excess space under x-axis, make flush with labels
labs(
title = "MISALIGNED",
subtitle = "!CAUTION: 7-day bars start Thursdays with first case\nGridlines at 1st of each month (with labels) and weekly on Mondays\nLabels on 1st of each month")
# TOTAL MONDAY ALIGNMENT: specify manual bin breaks to be mondays
#################################################################
ggplot(central_data, aes(x = date_onset)) +
geom_histogram(
# histogram breaks set to 7 days beginning Monday before first case
breaks = seq.Date(
from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days"),
color = "darkblue",
fill = "lightblue") +
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "3 weeks", # Monday every 3 weeks
date_minor_breaks = "week", # Monday weeks
date_labels = "%d\n%b\n'%y")+ # label format
labs(
title = "ALIGNED Mondays",
subtitle = "7-day bins manually set to begin Monday before first case (28 Apr)\nDate labels and gridlines on Mondays as well")
# TOTAL SUNDAY ALIGNMENT: specify manual bin breaks AND labels to be Sundays
############################################################################
ggplot(central_data, aes(x = date_onset)) +
geom_histogram(
# histogram breaks set to 7 days beginning Sunday before first case
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "7 days"),
color = "darkblue",
fill = "lightblue") +
scale_x_date(
expand = c(0,0),
# date label breaks set to every 3 weeks beginning Sunday before first case
breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "3 weeks"),
# gridlines set to weekly beginning Sunday before first case
minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 7)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
by = "7 days"),
date_labels = "%d\n%b\n'%y")+ # label format
labs(title = "ALIGNED Sundays",
subtitle = "7-day bins manually set to begin Sunday before first case (27 Apr)\nDate labels and gridlines manually set to Sundays as well")
# Check values of bars by creating dataframe of grouped values
# central_tab <- central_data %>%
# mutate(week = aweek::date2week(date_onset, floor_day = TRUE, factor = TRUE)) %>%
# group_by(week, .drop=F) %>%
# summarize(n = n()) %>%
# mutate(groups_3wk = 1:(nrow(central_tab)+1) %/% 3) %>%
# group_by(groups_3wk) %>%
# summarize(n = n())Designate a column containing groups
In any of the code template (Sunday weeks, Monday weeks), make the following changes:
aes() within the geom_histogram() (don’t forget comma afterward)aes(), provide the grouping column name to group = and fill = (no quotes needed). group is necessary, while fill changes the color of the bar.fill = argument outside of the aes(), as it will override the one insideaes() will apply by group, whereas any outside will apply to all bars (e.g. you may want color = outside, so each bar has the same color perimeter/border)geom_histogram(
aes(group = gender, fill = gender))
Adjust colors:
scale_fill_manual() (note scale_color_manual() is different!).
values = argument to apply a vector of colors.na.value = to specify a color for missing values.labels = argument in scale_fill_manual() change the legend text labels - it is easy to accidentally give labels in the incorrect order and have an incorrect legend! It is recommended to instead convert the group column to class Factor and designate factor labels and order, as explained below.Adjust the stacking order and Legend
Stacking order, and the labels for each group in the legend, is best adjusted by classifying the group column as class Factor. You can then designate the levels and their labels, and the order (which is reflected in stack order).
Step 1: Before making the ggplot, convert the group column to class Factor using factor() from base R.
For example, with a column “gender” with values “m” and “f” and NA, this can be put in a mutate() command as:
dataset <- dataset %>%
mutate(gender = factor(gender,
levels = c(NA, "f", "m"),
labels = c("Missing", "Female", "Male"),
exclude = NULL))
The above code establishes the levels, in the ordering that missing values are “first” (and will appear on top). Then the labels that will show are given in the same order. Lastly, the exclude statement ensures that NA is included in the ordering (by default factor() ignores NA).
Read more about factors in their dedicated handbook page (LINK).
Adjusting the legend
Read more about legends in the ggplot tips page. Here are a few highlights:
theme(legend.position = "top") (or “bottom”, “left”, “right”)theme(legend.direction = "horizontal")theme(legend.title = element_blank()) to have no titleSee the page of ggplot() tips for more details on legends.
These steps are shown in the example below:
########################
# bin break points for histogram defined here for clarity
# starts the Monday before first case, end Monday after last case
bin_breaks = seq.Date(
from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days") # bins are 7-days
# Set gender as factor and missing values as first level (to show on top)
central_data <- linelist %>%
filter(hospital == "Central Hospital") %>%
mutate(gender = factor(
gender,
levels = c(NA, "f", "m"),
labels = c("Missing", "Female", "Male"),
exclude = NULL))
# make plot
###########
ggplot(central_data, aes(x = date_onset)) +
geom_histogram(
aes(group = gender, fill = gender), # arguments inside aes() apply by group
color = "black", # arguments outside aes() apply to all data
breaks = bin_breaks)+ # see breaks defined above
# The labels on the x-axis
scale_x_date(
expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "3 weeks", # labels appear every 3 Monday weeks
date_minor_breaks = "week", # vertical lines appear every Monday week
date_labels = "%d\n%b\n'%y")+ # date labels format
# y-axis
scale_y_continuous(
expand = c(0,0))+ # removes excess y-axis space between bottom of bars and the labels
#scale of colors and legend labels
scale_fill_manual(
values = c("grey", "orange", "purple"))+ # specify fill colors ("values") - attention to order!
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(
plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"))+ # axis titles in bold
# labels
labs(
title = "Weekly incidence of cases, by gender",
subtitle = "Subtitle",
fill = "Gender", # provide new title for legend
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))Display bars side-by-side
Side-by-side display of group bars (as opposed to stacked) is specified within geom_histogram() with position = "dodge".
If there are more than two value groups, these can become difficult to read. Consider instead using a faceted plot (small multiples) (see tab). To improve readability in this example, missing gender values are removed.
########################
# bin break points for histogram defined here for clarity
# starts the Monday before first case, end Monday after last case
bin_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T), "week", week_start = 1)),
to = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
by = "7 days") # bins are 7-days
# New dataset without rows missing gender
central_data_dodge <- linelist %>%
filter(hospital == "Central Hospital") %>%
filter(!is.na(gender)) %>% # remove rows missing gender
mutate(gender = factor(gender, # factor now has only two levels (missing not included)
levels = c("f", "m"),
labels = c("Female", "Male")))
# make plot
###########
ggplot(central_data_dodge, aes(x = date_onset)) +
geom_histogram(
aes(group = gender, fill = gender), # arguments inside aes() apply by group
color = "black", # arguments outside aes() apply to all data
breaks = bin_breaks,
position = "dodge")+ # see breaks defined above
# The labels on the x-axis
scale_x_date(expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "3 weeks", # labels appear every 3 Monday weeks
date_minor_breaks = "week", # vertical lines appear every Monday week
date_labels = "%d\n%b\n'%y")+ # date labels format
# y-axis
scale_y_continuous(expand = c(0,0))+ # removes excess y-axis space between bottom of bars and the labels
#scale of colors and legend labels
scale_fill_manual(values = c("pink", "lightblue"))+ # specify fill colors ("values") - attention to order!
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"))+ # axis titles in bold
# labels
labs(title = "Weekly incidence of cases, by gender",
subtitle = "Subtitle",
fill = "Gender", # provide new title for legend
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))As with other ggplots, you can create facetted plots (“small multiples”) off values in a column. As explained in the ggplot tips page of this handbook, you can use either:
facet_wrap()facet_grid()For epicurves, facet_wrap() is typically easiest as it is likely that you only need to facet on one column. The general syntax is facet_wrap(rows ~ cols), where to the left of the tilde (~) is the name of a column to be spread across the “rows” of the new plot, and to the right of the tilde is the name of a column to be spread across the “columns” of the new plot.
Most simply, just use one column name, to the right of the tilde: facet_wrap(~age_cat).
Free axes
You will need to decide whether the scales (scales =) of the axes for each facet are “fixed” to the same dimensions (default), or “free” (meaning they will change based on the data within the facet). You can also specify “free_x” or “free_y” to release in only one dimension.
Number of cols and rows
This can be specified with ncol = and nrow = within facet_wrap().
Order of panels
To change the order of appearance, change the underlying order of the levels of the factor column used to create the facets.
Aesthetics
Font size and face, strip color, etc. can be modified through theme() with arguments like:
strip.text = element_text() (size, colour, face, angle…)strip.background = element_rect() (e.g. element_rect(fill=“red”))The position of the strip can be modified as the strip.position = argument within facet_wrap() (e.g. “bottom”, “top”, “left”, “right”)
Strip labels
Labels of the facet plots can be modified through the use of a “labeller”. Make a labeller like this, using the function as_labeller() from ggplot2:
my_labels <- as_labeller(c(
"0-4" = "Ages 0-4",
"5-9" = "Ages 5-9",
"10-14" = "Ages 10-14",
"15-19" = "Ages 15-19",
"20-29" = "Ages 20-29",
"30-49" = "Ages 30-49",
"50-69" = "Ages 50-69",
"70+" = "Over age 70"))An example plot
Faceted by column age_cat
# make plot
###########
ggplot(central_data, aes(x = date_onset)) +
geom_histogram(
aes(group = age_cat, fill = age_cat), # arguments inside aes() apply by group
color = "black", # arguments outside aes() apply to all data
breaks = bin_breaks)+ # see breaks defined above
# The labels on the x-axis
scale_x_date(expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "2 months", # labels appear every 2 months
date_minor_breaks = "1 month", # vertical lines appear every 1 month
date_labels = "%b\n'%y")+ # date labels format
# y-axis
scale_y_continuous(expand = c(0,0))+ # removes excess y-axis space between bottom of bars and the labels
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"),
legend.position = "bottom",
strip.text = element_text(face = "bold", size = 10),
strip.background = element_rect(fill = "grey"))+ # axis titles in bold
# create facets
facet_wrap(~age_cat,
ncol = 4,
strip.position = "top",
labeller = my_labels)+
# labels
labs(title = "Weekly incidence of cases, by age category",
subtitle = "Subtitle",
fill = "Age category", # provide new title for legend
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown")) See this link for more information on labellers.
Add total epidemic to background
Add a separate geom_histogram() command before the current one. Specify that the data used is the data without the column used for faceting (see select()). Then, specify a color like “grey” and a degree of transparency to make it appear in the background.
geom_histogram(data = select(central_data, -age_cat), color = "grey", alpha = 0.5)+
Note that the y-axis maximum is now based on the height of the entire epidemic.
ggplot(central_data, aes(x = date_onset)) +
# for background shadow of whole outbreak
geom_histogram(data = select(central_data, -age_cat), color = "grey", alpha = 0.5)+
# actual epicurves by group
geom_histogram(
aes(group = age_cat, fill = age_cat), # arguments inside aes() apply by group
color = "black", # arguments outside aes() apply to all data
breaks = bin_breaks)+ # see breaks defined above
# Labels on x-axis
scale_x_date(expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "2 months", # labels appear every 2 months
date_minor_breaks = "1 month", # vertical lines appear every 1 month
date_labels = "%b\n'%y")+ # date labels format
# y-axis
scale_y_continuous(expand = c(0,0))+ # removes excess y-axis space between bottom of bars and the labels
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"),
legend.position = "bottom",
strip.text = element_text(face = "bold", size = 10),
strip.background = element_rect(fill = "white"))+ # axis titles in bold
# create facets
facet_wrap(~age_cat, # each plot is one value of age_cat
ncol = 4, # number of columns
strip.position = "top", # position of the facet title/strip
labeller = my_labels)+ # labeller defines above
# labels
labs(title = "Weekly incidence of cases, by age category",
subtitle = "Subtitle",
fill = "Age category", # provide new title for legend
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))Create one facet with ALL data
To do this, you duplicate all the data (double the number of rows in the dataset) and in the faceted column have a new value (e.g. “all”) which indicates all the duplicated rows. A helped function is below that enables this:
# Define helper function
CreateAllFacet <- function(df, col){
df$facet <- df[[col]]
temp <- df
temp$facet <- "all"
merged <-rbind(temp, df)
# ensure the facet value is a factor
merged[[col]] <- as.factor(merged[[col]])
return(merged)
}
# Create dataset that is duplicated, to show "all zones" as another facet level
central_data2 <- CreateAllFacet(central_data, col = "age_cat") %>%
mutate(facet = factor(facet,
levels = c("all", "0-4", "5-9", "10-14", "15-19", "20-29", "30-49", "50-69", "70+")))
# check
table(central_data2$facet, useNA = "always")
##
## all 0-4 5-9 10-14 15-19 20-29 30-49 50-69 70+ <NA>
## 458 83 88 71 71 85 51 3 0 6Notable changes to the ggplot command are:
facet_wrap(facet~.), and ncol = 1You may also need to adjust the width and height of the save plot image (see ggsave())
ggplot(central_data2, aes(x = date_onset)) +
# actual epicurves by group
geom_histogram(
aes(group = age_cat, fill = age_cat), # arguments inside aes() apply by group
color = "black", # arguments outside aes() apply to all data
breaks = bin_breaks)+ # see breaks defined above
# Labels on x-axis
scale_x_date(expand = c(0,0), # remove excess x-axis space below and after case bars
date_breaks = "2 months", # labels appear every 2 months
date_minor_breaks = "1 month", # vertical lines appear every 1 month
date_labels = "%b\n'%y")+ # date labels format
# y-axis
scale_y_continuous(expand = c(0,0))+ # removes excess y-axis space between bottom of bars and the labels
# aesthetic themes
theme_minimal()+ # a set of themes to simplify plot
theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
axis.title = element_text(face = "bold"),
legend.position = "bottom")+
# create facets
facet_wrap(facet~. , # each plot is one value of facet
ncol = 1)
# labels
labs(title = "Weekly incidence of cases, by age category",
subtitle = "Subtitle",
fill = "Age category", # provide new title for legend
x = "Week of symptom onset",
y = "Weekly incident cases reported",
caption = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))
## $fill
## [1] "Age category"
##
## $x
## [1] "Week of symptom onset"
##
## $y
## [1] "Weekly incident cases reported"
##
## $title
## [1] "Weekly incidence of cases, by age category"
##
## $subtitle
## [1] "Subtitle"
##
## $caption
## n = 458 from Central Hospital; Case onsets range from Thu 01 May 2014 to Tue 28 Apr 2015
## 20 cases missing date of onset and not shown
##
## attr(,"class")
## [1] "labels"Years under months
1) Faceting 2) …
https://stackoverflow.com/questions/44616530/axis-labels-on-two-lines-with-nested-x-variables-year-below-months https://stackoverflow.com/questions/20571306/multi-row-x-axis-labels-in-ggplot-line-chart
Best is to use the aweek package to create a new column, then group_by and summarize,
Get weeks, then plot use aweek to convert dates to weeks, then plot the weekly counts
To aggregate into weeks and show ALL weeks (even ones with no cases), do these two things:
1) use floor_date = T and factor = T in the date2week() command, and
2) use .drop = F in the group_by() command
like this:
central_tab <- central_data %>%
mutate(week = aweek::date2week(date_onset, floor_day = TRUE, factor = TRUE)) %>%
group_by(week, .drop=F) %>%
summarize(n = n())
linelist <- linelist %>%
mutate(week = aweek::date2week(date_onset, floor_day = T))
linelist %>%
group_by(week, .drop = F) %>%
summarise(n = n())
## # A tibble: 57 x 2
## week n
## <aweek> <int>
## 1 2014-W15 1
## 2 2014-W16 1
## 3 2014-W17 5
## 4 2014-W18 4
## 5 2014-W19 12
## 6 2014-W20 17
## 7 2014-W21 15
## 8 2014-W22 21
## 9 2014-W23 24
## 10 2014-W24 21
## # ... with 47 more rows# Preparation
#############
# Create epiweek variable. Factor argument automatically includes all weeks in span. Numeric shows just the week number.
linelist$epiweek <- aweek::date2week(linelist$date_onset, factor = TRUE, numeric = TRUE)
# Calculate maximum number of cases in an epiweek, to get the maximum y-axis height (also helps with uniformity in multiple plots)
ymax <- max(summary(factor(linelist$epiweek), maxsum = length(linelist$epiweek)))
# Weekly case counts
###################
plot_weekly <- ggplot(linelist, aes(x = date_onset)) +
# stacked bars, bined by week (7 days)
stat_bin(binwidth = 7, position = "stack", fill = "grey", color = "black") +
# X-axis 21-day labels
scale_x_date(
# Sets date label breaks as every 3 weeks from Monday before the first case
breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm=T), by = "3 weeks"),
# axis limits determined by max/min + buffer
limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)),
# displays as date number, then abbreviated month (e.g. 12 Oct)
date_labels = "%d-%b",
# sets origin at (0,0)
expand = c(0,0)) +
# Y-axis breaks every 5 cases
scale_y_continuous(breaks = seq(0, ymax, 25),
limits = c(0, ymax),
expand = c(0, 0)) +
# Theme specifications (axis, text, etc.)
theme(# title
plot.title = element_text(size=20, hjust= 0, face="bold"), # title size, font, bold
# axes
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
axis.line = element_line(colour="black"),
# background
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
# caption (italics, on right side)
plot.caption = element_text(hjust = 0, face = "italic")
) +
guides(fill = guide_legend(reverse = TRUE, # Orders Non-active zones at end of legend
override.aes = list(size = 0.2),
ncol = 2)) + # Number of legend columns
labs(x = "Week of illness onset",
y = "Number of cases",
subtitle = "subtitle here")
ggtitle("Epidemic curve")
## $title
## [1] "Epidemic curve"
##
## attr(,"class")
## [1] "labels"
# print
print(plot_weekly)# PARAMETERS
#############
# Maximum y-value for epiweek (this will be larger than necessary because of missing onset dates)
ymax <- max(table(linelist$epiweek))
# Number missing onset_date and cannot be graphed
missing_onset <- nrow(filter(linelist, is.na(date_onset)))
# SETUP - ACTIVE/NON-ACTIVE ZONES
#################################
# List of "active" zones with a case in the date range
active_zones <- unique(linelist$province[which(linelist$date_onset > (data_date - 90))])
active_zones
# Table of active zones and their overall number of cases (for ordering their stacked appearance)
order_table <- linelist %>%
filter(province %in% active_zones) %>%
group_by(province) %>%
summarise(cases = n())
order_table
# Create TRUE/FALSE variable for "active" health zones
linelist$active_zone <- ifelse(linelist$province %in% active_zones, TRUE, FALSE)
# Create list of non-active HZ names for bottom of plot
other_zone_names <- unique(sort(linelist$province[linelist$active_zone == FALSE]))
# Make variable for graph categories, including a level for "non-active" zones
linelist$graph_zone <- factor(case_when(
# Value assignments
# Non-active zones
linelist$active_zone == FALSE ~ "Non-active zones",
# All others are assigned their names, capitalized
TRUE ~ stringr::str_to_title(linelist$province)),
# Order of variable levels
levels = c(
# "Non-active zones" is first level
"Non-active zones",
# Orders active zones by their frequency in linelist, reversed, so most-affected zones are on the BOTTOM of plot
str_to_title(rev(levels(fct_infreq(as.factor(linelist$province[linelist$active_zone == TRUE])))))))
table(linelist$graph_zone, useNA = "ifany")
# COLORS
########
# Number of unique values in graph_zone variable, minus 1 (for non-active, which is added later as grey (#cccccc))
colors_needed <- length(unique(linelist$graph_zone, na.rm=T)) - 1
# List of possible colors (see colorbrewer2.com, qualitative scheme)
colors_linelist = c(#"#cccccc", # first = non-active grey color
"#1b9e77", # turquoise green
"#ff7f00", # orange
"#ffff33", # yellow
"#6a3d9a", # purple
"#b15928", # brown
"#1f78b4", # blue
"#e31a1c", # red,
"#fb9a99", # pink
"#b2df8a", # light green
"#cab2d6", # light purple
"#a6cee3", # light blue
"#fdbf6f", # beige
"#33a02c" # green
)
# Reduce number of colors to only the number needed
colors_linelist <- c("#cccccc", rev(colors_linelist[1:colors_needed]))
# MAKE GRAPH
#############
plot_overall <- ggplot(linelist, aes(x = date_onset, fill = graph_zone)) +
# stacked bars, bined by week (7 days)
stat_bin(binwidth = 7, position = "stack") +
# Fill of bars
scale_fill_manual(values = colors_linelist,
labels = str_to_sentence(levels(factor(linelist$graph_zone)))) +
# X-axis 21-day labels
scale_x_date( # Sets date label breaks as every 3 weeks from Monday before the first case
breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "3 weeks"),
limits = c((min(linelist$date_onset, na.rm = T) - 8), (max(linelist$date_onset, na.rm = T) + 8)), # axis limits determined by max/min + buffer
date_labels = "%d-%b", # displays as date number, then abbreviated month (e.g. 12 Oct)
expand = c(0,0)) + # sets origin at (0,0)
# Y-axis breaks every 5 cases
scale_y_continuous(breaks = seq(0, ymax, 25),
limits = c(0, ymax),
expand = c(0, 0)) +
# Theme specifications (axis, text, etc.)
theme(plot.title = element_text(size = 20, hjust = 0, face = "bold"), # title size, font, bold
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
axis.line = element_line(colour="black"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
plot.caption = element_text(hjust = 0, face = "italic")
) +
# Legend specifications
theme(legend.title = element_blank(), # No legend title
legend.position = c(0.20, 0.85), # placement of legend
legend.background = element_blank(), # legend background
legend.text = element_text(size=12)) + # legend text size
guides(fill = guide_legend(reverse = TRUE, # Orders Non-active zones at end of legend
override.aes = list(size = 0.2),
ncol = 2)) + # Number of legend columns
labs(x = "Week of illness onset",
y = "Number of cases",
subtitle = "Health zones with cases in the last 42 days specified by color",
caption = paste0(nrow(linelist),
" confirmed and probable cases, reported as of ", data_date, ". ",
missing_onset, " cases missing date of onset and not shown.",
"\nNon-active zones include: ", str_to_title(toString(unique(linelist$province[linelist$active_zone == FALSE]))))) +
ggtitle("Epidemic curve by active health zones")
# print
plot_overallOften you do not have linelist data, but instead daily case counts from facilities, districts, etc. You can plot these in an epidemiological curve, but the code will be slightly different.
This section will utilize the counts_data dataset that was imported earlier, in the data preparation section.
Note: The incidence package does not support aggregate data
As before, we must ensure date variables are correctly classified.
# Create epiweek variable
# aweek weeks are also stored as dates, facilitating better display manipulation
count_data$epiweek <- aweek::date2week(count_data$date_hospitalisation, # use the Date variable
week_start = "Monday", # epiweek begins on Monday
floor_day = TRUE, # only display year and week #
factor = TRUE) # expand to include all possible weeksggplot(data = count_data, aes(x = as.Date(epiweek), y = n_cases, group = hospital, fill = hospital))+
geom_bar(stat = "identity")+
# LABELS for x-axis
scale_x_date(date_breaks = "1 month", # displays by month
date_labels = '%b%d\n%Y')+ #labeled by month with year below
# Choose color palette (uses RColorBrewer package)
scale_fill_brewer(palette = "Pastel1")+
# Theme specifications (axis, text, etc.)
theme(
# title
plot.title = element_text(size=20, hjust= 0, face="bold"), # title size, font, bold
# axes
axis.text.x = element_text(angle=0, vjust=0.5, hjust=1),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
axis.line = element_line(colour="black"),
# background
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
# caption (italics, on right side)
plot.caption = element_text(hjust = 0, face = "italic"))+
# labels
labs(x = "Week of report",
y = "Number of cases",
subtitle = "Cases aggregated by week and shown by hospital",
caption = "Data source: XXXXX")+
ggtitle("Epidemic curve of disease X in fictional location")Although there are fierce discussions about the validity of this within the data visualization community, many supervisors want to see an epicurve or similar chart with a percent overlaid with a second axis.
In ggplot it is very difficult to do this, except for the case where you are showing a line reflecting the proportion of a category shown in the bars below.
This uses the linelist dataset
TODO not complete yet
library(reshape2)
# group the data by week, summarize counts by group (gender)
linelist_week <- linelist %>%
mutate(onset_epiweek = aweek::date2week(date_onset, floor_day = TRUE, factor = TRUE)) %>%
group_by(onset_epiweek) %>%
summarize(num_male = sum(gender == "m"),
num_female = sum(gender == "f"),
pct_male = round(100*(num_male / n())),
med_age = median(as.numeric(age), na.rm=T)
)
# remove pct and melt
linelist_week_melted <- linelist_week %>%
select(-c("pct_male", "med_age")) %>%
melt(id.vars = c("onset_epiweek"))
# merge together (multiple of the same values in week will attach to melted)
linelist_week_melted <- merge(linelist_week_melted,
linelist_week,
by = "onset_epiweek")
second_axis <- ggplot(linelist_week_melted,
aes(x = as.Date(onset_epiweek),
y = value, group = variable,
fill = variable)) +
# bars
geom_bar(stat = "identity")+
# Colors and labels of confirmed/probable
scale_fill_manual(values = c("blue", "red"),
labels = str_to_sentence(levels(factor(linelist_week_melted$variable)))) +
geom_line(mapping = aes(y = pct_male, color = "% male"), size = 0.5) +
scale_color_manual(values = "black")+
scale_y_continuous(sec.axis = sec_axis(~(./sum(linelist_week_melted$value, na.rm = T)*100), name = "name here", breaks = seq(0, 100, 20)))+
# X-axis scale labels (not aggregation, just the labels)
scale_x_date(# Sets date label breaks as every week
breaks = function(x) seq.Date(from = min(linelist$date_onset, na.rm = T), to = max(linelist$date_onset, na.rm = T), by = "1 week"),
limits = c((min(linelist$date_onset, na.rm=T)), (max(linelist$date_onset, na.rm = T))), # axis limits determined by max/min + buffer
date_labels = "%d-%b", # displays as date # then abbreviated month (e.g. 12 Oct)
expand = c(0, 0)) + # sets origin at (0,0)
# Y-scale in breaks, up to the ymax previously defined
scale_y_continuous(breaks = seq(0, 500, 5), limits = c(0, ymax), expand=c(0, 0)) +
# Themes for axes, titles, background, etc.
theme(plot.title = element_text(size=20, hjust=0.5, face="bold"),
axis.text = element_text(size=12),
axis.title = element_text(size=14, face="bold"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
axis.line = element_line(colour = "black"),
axis.text.x = element_text(angle=90, vjust=0.5, hjust=1)) +
# Legend specifications
theme(legend.title = element_blank(),
legend.justification = c(0, 1),
legend.position = c(0.09, 0.98),
legend.background = element_blank(),
legend.text = element_text(size = 12)) +
guides(fill = guide_legend(reverse = TRUE, override.aes = list(size = 0.2))) +
# Axis and caption labels
labs(x = "Week of illness onset",
y = "Number of Cases",
caption = paste(missing_onset,"cases were missing onset date and are not included in the onset graph")) +
# Title
ggtitle("Cases by week of illness onset")
second_axis
# print
print(plot_defined_cats)This tab should stay with the name “Resources”. Links to other online tutorials or resources.